Welcome to the third part of our 16SrRNA Amplicon Sequencing. Here, we will take a look into the diversity of ASVs within each of the samples, and squeeze this dataset to extract as much information as possible from it.
The way we compare microbial communities is using tools from community ecology, and we have heaps of theoretical and practical knowledge about them because they were originally developed for ecology of macroorganisms. With R, calculating the metrics is much easier than understanding what they mean, so pay attention to what each metric means when you’re interpreting the results!
Please take your time to read all instructions carefully, prepare your scripts and understand what you want to achieve with each command.
Before we begin, set the paths in your local environment on the server by replacing these paths with the ones that point to the directory where you want to store the results from today’s tutorial, and to the phyloseq object you saved in the previous part 2. We will also need to reload the Metadata table, which has been cleaned (Metadata_kefir_modified.csv).
#set path in plain
PathToWD <- (".")
knitr::opts_knit$set(root.dir = normalizePath(PathToWD)) PathToWD <- ("/Users/admin/Documents/UNIL/Enseignement/SAGE/SAGE2")
path2ps <- paste(PathToWD, "04_Phyloseq_object/", sep="/")
setwd(path2ps)
ps <- readRDS("PhyloSeq_Object.rds")
metaKefir <- read.table(file="Metadata_kefir_modified.csv",sep=",",header=TRUE)
rownames(metaKefir) <- metaKefir$id_sample
path2dv <- paste(PathToWD, "05_Diversity", sep="/")
dir.create(path2dv)
setwd(path2dv)We will use the phyloseq package, which was specifically developed for the analysis of microbial communities. We will need also some extensions and other packages.
library(phyloseq)
library(phyloseq.extended)
library(ggplot2)
library(ggpubr)
library(genefilter)
library(vegan)
library(ampvis)To begin, you will need the phyloseq object you should have constructed at the end of the Part 2. You’ve already loaded it above, but here is the code again in case you need to reload a fresh one:
setwd(path2ps)
ps <- readRDS("PhyloSeq_Object.rds")
sample_data(ps) <- metaKefir
path2dv <- paste(PathToWD, "05_Diversity", sep="/")
setwd(path2dv)
# Your phyloseq object is now in *ps*
sample_data(ps) <- metaKefir
ps## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3943 taxa and 35 samples ]
## sample_data() Sample Data: [ 35 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 3943 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 3943 reference sequences ]
This is a special R object that must contain the following:
Since the phyloseq object is a special kind of object, each data object will be accesible differently:
otu_table(ps)
sample_data(ps)
tax_table(ps)phyloseq provides a set of functions to quickly retrieve useful information from your phyloseq object… they are self-explicative, so try them and figure out what they do:
library(phyloseq)
ps## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3943 taxa and 35 samples ]
## sample_data() Sample Data: [ 35 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 3943 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 3943 reference sequences ]
rank_names(ps)## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
sample_variables(ps)## [1] "id_sample" "treatment" "DNAconc" "inoculum"
## [5] "fruit" "time" "label" "NovoID"
## [9] "student_Number" "Family_Name" "First_name" "AllFactor"
get_variable(ps) ntaxa(ps)## [1] 3943
nsamples(ps)## [1] 35
sample_names(ps)## [1] "K1" "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2"
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3" "K30"
## [25] "K31" "K32" "K33" "K34" "K35" "K4" "K5" "K6" "K7" "K8" "K9"
taxa_names(ps)[1:20] # Just the first 20## [1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6" "ASV7" "ASV8" "ASV9"
## [10] "ASV10" "ASV11" "ASV12" "ASV13" "ASV14" "ASV15" "ASV16" "ASV17" "ASV18"
## [19] "ASV19" "ASV20"
sample_sums(ps)## K1 K10 K11 K12 K13 K14 K15 K16 K17 K18 K19
## 101652 69731 96825 92202 87283 68179 64264 95494 94088 66573 78783
## K2 K20 K21 K22 K23 K24 K25 K26 K27 K28 K29
## 88323 96428 82532 85009 81102 76271 98489 96448 100423 100134 72124
## K3 K30 K31 K32 K33 K34 K35 K4 K5 K6 K7
## 85544 83242 76512 81086 78832 89789 68099 89441 93646 63712 96747
## K8 K9
## 101635 92416
taxa_sums(ps)[1:20] # Just the first 20## ASV1 ASV2 ASV3 ASV4 ASV5 ASV6 ASV7 ASV8 ASV9 ASV10
## 1402056 668284 169362 122123 91989 85013 51502 32011 26215 25768
## ASV11 ASV12 ASV13 ASV14 ASV15 ASV16 ASV17 ASV18 ASV19 ASV20
## 24726 16148 14744 14068 12061 8876 7889 6810 6072 4764
get_taxa(ps,"K11")[1:20] # Just the furst 20 ASV abundances for ONE sample## ASV1 ASV2 ASV3 ASV4 ASV5 ASV6 ASV7 ASV8 ASV9 ASV10 ASV11 ASV12 ASV13
## 21946 45282 6003 3736 2367 7282 239 2134 384 0 410 220 409
## ASV14 ASV15 ASV16 ASV17 ASV18 ASV19 ASV20
## 31 165 316 537 26 19 121
get_sample(ps,"ASV1") # Abundances in ALL samples for ONE OTU. ## K1 K10 K11 K12 K13 K14 K15 K16 K17 K18 K19 K2 K20
## 46697 23264 21946 22921 28854 31401 31853 72118 50449 32366 28492 41991 58425
## K21 K22 K23 K24 K25 K26 K27 K28 K29 K3 K30 K31 K32
## 38411 37748 10634 22326 55201 54925 62010 19252 38184 53705 51182 51895 59333
## K33 K34 K35 K4 K5 K6 K7 K8 K9
## 46004 33661 37910 29042 57836 43712 47787 44992 15529
You can manipulate these as common R objects:
class( sample_sums(ps) )## [1] "numeric"
str( sample_sums(ps) )## Named num [1:35] 101652 69731 96825 92202 87283 ...
## - attr(*, "names")= chr [1:35] "K1" "K10" "K11" "K12" ...
head(sort(sample_sums(ps),decreasing = TRUE),35)## K1 K8 K27 K28 K25 K11 K7 K26 K20 K16 K17
## 101652 101635 100423 100134 98489 96825 96747 96448 96428 95494 94088
## K5 K9 K12 K34 K4 K2 K13 K3 K22 K30 K21
## 93646 92416 92202 89789 89441 88323 87283 85544 85009 83242 82532
## K23 K32 K33 K19 K31 K24 K29 K10 K14 K35 K18
## 81102 81086 78832 78783 76512 76271 72124 69731 68179 68099 66573
## K15 K6
## 64264 63712
sample_sums(ps)["K11"]## K11
## 96825
taxa_sums(ps)[1:15]## ASV1 ASV2 ASV3 ASV4 ASV5 ASV6 ASV7 ASV8 ASV9 ASV10
## 1402056 668284 169362 122123 91989 85013 51502 32011 26215 25768
## ASV11 ASV12 ASV13 ASV14 ASV15
## 24726 16148 14744 14068 12061
taxa_sums(ps)[c("ASV1","ASV10","ASV100")]## ASV1 ASV10 ASV100
## 1402056 25768 420
sample_data(ps)$AllFactor <- rep("ALL",35)We now have everything we need to start exploring our dataset!
We will first need to get rid of one sample that corresponds to Vladimir’s kefir, from which Philipp’s and Vincent’s kefir originate. Vladimir’s kefir composition will not be studied in this tutorial.
ps = subset_samples(ps, First_name != "Vladimir")
ps## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3943 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 3943 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 3943 reference sequences ]
We also need to filter a few taxa that do not correspond to bacterial ASVs, like chloroplasts and mitochondria.
A PCR reaction can effectively amplify templates even more diluted than 1 copy per 50 uL. Unfortunately, this means that it can also amplify undesired DNA from contaminant genomic DNA present in the sample, such as host’s mitochondria and chloroplasts, bacteria in transit in the environment or human contamination due to sample handling. The effect will be even larger in samples with very low biomass of the desired community.
There are many ways to control for contaminants, but none is standardized. Monitoring total DNA extracted before preparing the libraries is a good way to detect low-yield samples. Running blank or ghost DNA extractions (without the desired tissue or sample) are good to detect contaminants in reagents.
First of all, we will search for sequences assigned to mitochondria from the yeasts or chloroplasts from the figs and raisins.
rank_names(ps)## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
# Check if there are any ASVs classified as eukaryotic or NA
get_taxa_unique(ps,"Kingdom")## [1] "Bacteria" NA "Archaea"
# Remove ASVs assigned to chloroplasts or mitochondria
ps <- subset_taxa(ps, Kingdom != "Archaea" | is.na(Kingdom))
ps <- subset_taxa(ps, Order != "Chloroplast" | is.na(Order))
ps <- subset_taxa(ps, Family != "Mitochrondria" | is.na(Family))Your object ps should not contain any ASVS classified as chloroplasts or mitochondria.
Differences in sampling effort (sequening depth) can strongly affect diversity analyses, because samples that appear to have the same diversity might actually be sampled at different depths.
One technique to evaluate and compare the sampling depth of a community is rarefaction. Rarefaction will subsample each sample at different sequencing depths, and estimate how many ASVs you would have if you had sampled at that specific depth. A cumulative rarefaction curve then is plotted by joining the expectations at different depths. As such, it is also a measure of estimated richness.
ps## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3882 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 12 sample variables ]
## tax_table() Taxonomy Table: [ 3882 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 3882 reference sequences ]
pl.rare.all <- ggrare(ps, step = 1000, label = "Sample", se = FALSE) + xlab("Sequencing Depth") + ylab("ASV Richness")## rarefying sample K1
## rarefying sample K10
## rarefying sample K11
## rarefying sample K12
## rarefying sample K13
## rarefying sample K14
## rarefying sample K15
## rarefying sample K16
## rarefying sample K17
## rarefying sample K18
## rarefying sample K19
## rarefying sample K2
## rarefying sample K20
## rarefying sample K21
## rarefying sample K22
## rarefying sample K23
## rarefying sample K24
## rarefying sample K25
## rarefying sample K26
## rarefying sample K27
## rarefying sample K28
## rarefying sample K29
## rarefying sample K3
## rarefying sample K30
## rarefying sample K31
## rarefying sample K32
## rarefying sample K33
## rarefying sample K34
## rarefying sample K4
## rarefying sample K5
## rarefying sample K6
## rarefying sample K7
## rarefying sample K8
## rarefying sample K9
What you need to pay attention to:
What have you learnt from the rarefaction curve?
Here, because the samples are not too heterogeneous in their sequencing depths, and because the smaller samples (K6 and K15) have a sequencing depth at which all other samples have reached their plateau, there is no need to rarefy, we would just lose information.
Another way to infer how many sampled bacteria were in transit (randomly sampled from the environment), is to count in how many samples each ASV is found with an abundance larger than zero. We call this the prevalence of an ASV.
In brief, those bacteria that are central components essential for your ecosystem will be present in all samples, and in relatively large proportions. ASVs from accidentally sampled bacteria, contaminants or artifacts will have a prevalence around 1 and generally in low proportions.
# Calculate prevalence across all samples
ps.prev <- estimate_prevalence(ps,group="AllFactor",rarefy = FALSE)
ps.prev$samplePrev <- ps.prev$prevalence * 35 # Convert to absolute sample numbers
ps.prev$relAbund <- ( ps.prev$abundance / sum(ps.prev$abundance) ) # Calculate relative abundance
#Plot
pl.prev.his <- ggplot(ps.prev, aes(x=samplePrev))+
geom_histogram(colour="steelblue",fill="steelblue", binwidth = 1, boundary = -0.5) +
scale_x_continuous(breaks = 1:35) +
xlab("Number of samples in which ASVs are found") +
ylab("Number of ASVs") +
stat_bin(binwidth=1, geom='text', colour="black", size = 2, aes(label=..count..), position=position_stack(vjust = 1.1))
pl.prev.his## Compare prevalence with relative abundance
# select the "Phylum" taxonomic assignation for your ASVs
ps.tax.class <- data.frame(tax_table(ps)[,2])
# Plot
pl.prev.sct <- ggplot(ps.prev, aes(x=samplePrev, y=relAbund, color=ps.tax.class$Phylum)) +
geom_point() +
theme(legend.title=element_blank(),legend.text = element_text(size = 8),
legend.position="bottom") +
xlab("Number of samples in which ASVs are found") +
ylab("Relative abundance")
pl.prev.sctMost of what we’ve done so far depends on the number of ASVs we have per sample or richness. The richness of a sample can be affected by sequencing effort, DNA yields, volume sampled and many other factors. Let’s calculate ASV richness and compare it to the total number of reads per sample.
# Calculate how many total ASVs
tmp_df <- t(otu_table(ps))
ps.asv <- data.frame(colSums(tmp_df!=0))
colnames(ps.asv) <- "ASVs"
rm(tmp_df)
# Now add the total read number
ps.asv$reads <- sample_sums(ps)
colnames(ps.asv) <- c("ASVs", "reads")
ps.asv$sampleID <- rownames(ps.asv)
# Make barplots for each sample:
pl.asv.bar <- ggplot(ps.asv, aes(x = sampleID, y = ASVs) ) +
ggtitle("Total ASVs per sample") +
ylab("ASV number") +
geom_bar(stat="identity", fill="steelblue")+
theme(axis.text.x = element_text(angle=90))
pl.red.bar <- ggplot(ps.asv, aes(x = sampleID, y = reads) ) +
ggtitle("Total reads per sample") +
ylab("Read number") +
geom_bar(stat="identity", fill="indianred")+
theme(axis.text.x = element_text(angle=90))
# Histogram of ASVs and read counts
pl.asv.his <- ggplot(ps.asv, aes(x = reads)) +
geom_histogram(color = "black", fill = "indianred", binwidth = 5000) +
ggtitle("Total reads distribution") +
xlab("Reads") +
ylab("Samples")+
theme(axis.text.x = element_text(angle=90))
pl.red.his <- ggplot(ps.asv, aes(x = ASVs)) +
geom_histogram(color = "black", fill = "steelblue", binwidth = 50) +
ggtitle("Total ASV distribution") +
xlab("ASVs") +
ylab("Samples")+
theme(axis.text.x = element_text(angle=90))
# Visualize
ggarrange(pl.red.bar, pl.asv.bar, pl.asv.his, pl.red.his, ncol=2, nrow=2)summary(ps.asv)## ASVs reads sampleID
## Min. : 35.0 Min. : 63418 Length:34
## 1st Qu.: 78.0 1st Qu.: 78697 Class :character
## Median :156.5 Median : 86350 Mode :character
## Mean :185.6 Mean : 85758
## 3rd Qu.:251.8 3rd Qu.: 95910
## Max. :498.0 Max. :101518
It looks like we might be seeing more ASVs in samples that were sequenced more deeply (yielded more reads). Let’s see if we can have a better picture of this:
# Compare ASVs against Reads
summary(lm(ps.asv$ASVs ~ ps.asv$reads))##
## Call:
## lm(formula = ps.asv$ASVs ~ ps.asv$reads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.91 -86.74 -40.08 77.89 301.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65.713512 161.642057 -0.407 0.687
## ps.asv$reads 0.002931 0.001869 1.568 0.127
##
## Residual standard error: 122.4 on 32 degrees of freedom
## Multiple R-squared: 0.07136, Adjusted R-squared: 0.04234
## F-statistic: 2.459 on 1 and 32 DF, p-value: 0.1267
pl.asv_red.scp <- ggplot(ps.asv, aes(x=reads, y=ASVs)) +
geom_point()+
geom_smooth(method=lm)
## Compare ASVs against DNA yields
# Add DNA yields from the sample metadata
tmp1 <- sample_data(ps)[,"DNAconc"]
ps.asv2 <- merge(ps.asv, tmp1, by="row.names", all=TRUE)
rownames(ps.asv2) <- ps.asv2$sampleID
ps.asv2[,"Row.names"] <- NULL
# Compare ASVs against DNA yield
summary(lm(ps.asv2$ASVs ~ ps.asv2$DNAconc))##
## Call:
## lm(formula = ps.asv2$ASVs ~ ps.asv2$DNAconc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.35 -110.41 -25.80 66.65 312.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 177.1807 38.6265 4.587 6.57e-05 ***
## ps.asv2$DNAconc 0.1133 0.4285 0.264 0.793
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126.8 on 32 degrees of freedom
## Multiple R-squared: 0.002179, Adjusted R-squared: -0.029
## F-statistic: 0.06987 on 1 and 32 DF, p-value: 0.7932
pl.asv_dna.scp <- ggplot(ps.asv2, aes(x=DNAconc, y=ASVs)) +
geom_point()+
geom_smooth(method=lm)
rm(tmp1, tmp2)
# Visualize
ggarrange(pl.asv_dna.scp, pl.asv_red.scp, ncol=2, nrow=1)Now let’s see what is the composition of each sample and how similar it is to what has been previously published.
## First lets count how many Phyla are in each sample
get_taxa_unique(ps,taxonomic.rank = "Phylum")## [1] "Firmicutes" "Actinobacteriota"
## [3] "Proteobacteria" "Desulfobacterota"
## [5] "Bacteroidota" "Verrucomicrobiota"
## [7] "Cyanobacteria" "Campilobacterota"
## [9] "Deferribacterota" NA
## [11] "Marinimicrobia_(SAR406_clade)" "SAR324_clade(Marine_group_B)"
## [13] "Planctomycetota" "Acidobacteriota"
## [15] "Chloroflexi" "RCP2-54"
## [17] "Dadabacteria" "Fibrobacterota"
## [19] "Abditibacteriota" "NB1-j"
## [21] "Nitrospirota" "Gemmatimonadota"
## [23] "Myxococcota" "Bdellovibrionota"
## [25] "Armatimonadota" "Fusobacteriota"
## [27] "Spirochaetota" "Entotheonellaeota"
## [29] "Synergistota" "WPS-2"
## [31] "Elusimicrobiota" "FCPU426"
## [33] "Nitrospinota" "Methylomirabilota"
## [35] "PAUC34f" "Latescibacterota"
## [37] "Patescibacteria" "Deinococcota"
## [39] "Deferrisomatota" "Calditrichota"
## [41] "Acetothermia" "Dependentiae"
## [43] "Sva0485"
# Let''s look at the distribution of phyla in each sample
ps.phy <- tax_glom(ps, taxrank="Phylum")
ps.phy.df <- psmelt(ps.phy)
# Plot each treatment group next to each other
ps.phy.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Phylum")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the phylum level")# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)
# Plot each treatment group next to each other
ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level")By now you probably realized that there are lots of low-prevalence and low-abundance (rare) ASVs in our samples, that are either contaminats or simply noise in our dataset. Thus we will now filter ASVs that are too rare and in very few samples, and retain those with larger abundance and prevalence.
To counter the differences in sequencing depth, we will transform our data with absolute read counts to proportions, where the sum of all reads per sample equals 1. Then we will retain only ASVs whose relative abundance is larger than 1% (0.01) in at least 1 sample. In this way, we can keep ASVs that are relatively abundant in a few samples too.
library(genefilter)
ps.prop = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.filt = filter_taxa(ps.prop, filterfun(kOverA(1, 0.01)), TRUE)
# Then save an object with the same ASVs but with raw counts
toKeep <- taxa_names(ps.filt)
ps.rawfilt <- prune_taxa(toKeep, ps)
# Fix tax_table
tmptax <- data.frame(tax_table(ps.filt))
tmptax$Linnaeus <- paste(tmptax$Genus,tmptax$Species )
tax_table(ps.filt) <- as.matrix(tmptax)
tax_table(ps.rawfilt) <- as.matrix(tmptax)
# Plot again at Genus level
plot_bar(ps.filt, x="id_sample",fill="Genus") plot_bar(ps.rawfilt, x="id_sample",fill="Genus") # Plot at species level
plot_bar(ps.filt, x="id_sample",fill="Linnaeus" ) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
guides(color=guide_legend(ncol=1, byrow=FALSE))plot_bar(ps.rawfilt, x="id_sample",fill="Linnaeus" ) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
guides(color=guide_legend(ncol=1, byrow=FALSE))Now that we have cleaned and filtered our dataset, we can perform the proper diversity analyses.
Let’s first have a look at the local, individual, diversity Alpha diversity in the samples.
Remember that we want to answer three questions: Question 1: How similar are the communities in the two kefir starter cultures? Question 2: How stable are the communities over time? Question 3: Do they change in response to the environment?
#ps.rawfilt.q1 = subset_samples(ps.rawfilt, time!="END")
ps.q3 = subset_samples(ps, time!="START")
ps.filt.q3 = subset_samples(ps.filt, time!="START")
ps.rawfilt.q3 = subset_samples(ps.rawfilt, time!="START")# Question1: How similar are the communities in the two kefir starter cultures?
pl.alpha <- plot_richness(ps.rawfilt, color = "inoculum", x="inoculum", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 1 17.50 17.50 3.963 0.0557 .
## reads 1 31.69 31.69 7.175 0.0119 *
## DNAconc 1 5.76 5.76 1.305 0.2624
## Residuals 30 132.49 4.42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 1 0.4820 0.4820 6.150 0.019 *
## reads 1 0.0001 0.0001 0.002 0.969
## DNAconc 1 0.1896 0.1896 2.418 0.130
## Residuals 30 2.3515 0.0784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## inoculum 1 0.0682 0.06822 5.444 0.0265 *
## reads 1 0.0019 0.00188 0.150 0.7010
## DNAconc 1 0.0253 0.02534 2.022 0.1654
## Residuals 30 0.3760 0.01253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Question 2: How stable are the communities over time?
pl.alpha <- plot_richness(ps.rawfilt, color = "time", x="time", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## time 1 0.82 0.825 0.173 0.6801
## reads 1 21.24 21.239 4.464 0.0430 *
## DNAconc 1 22.66 22.657 4.762 0.0371 *
## Residuals 30 142.72 4.757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## time 1 0.0461 0.04607 0.469 0.499
## reads 1 0.0230 0.02300 0.234 0.632
## DNAconc 1 0.0076 0.00762 0.078 0.782
## Residuals 30 2.9465 0.09822
ps.alpha.aov <- aov(Simpson ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## time 1 0.0046 0.004638 0.299 0.589
## reads 1 0.0002 0.000152 0.010 0.922
## DNAconc 1 0.0006 0.000613 0.039 0.844
## Residuals 30 0.4660 0.015534
# Question 3: Do they change in response to the environment (the fruit)?
pl.alpha <- plot_richness(ps.rawfilt.q3, color = "fruit", x="fruit", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt.q3, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt.q3), ps.alpha)
ps.dataA$reads <- sample_sums(ps.q3)
ps.alpha.aov <- aov(Observed ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.21 0.215 0.048 0.8275
## reads 1 21.33 21.327 4.806 0.0375 *
## DNAconc 1 16.96 16.958 3.822 0.0614 .
## Residuals 26 115.37 4.437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.0691 0.06913 0.659 0.424
## reads 1 0.0253 0.02531 0.241 0.627
## DNAconc 1 0.0343 0.03433 0.327 0.572
## Residuals 26 2.7269 0.10488
ps.alpha.aov <- aov(Simpson ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## fruit 1 0.0146 0.014581 0.885 0.356
## reads 1 0.0006 0.000572 0.035 0.854
## DNAconc 1 0.0026 0.002550 0.155 0.697
## Residuals 26 0.4286 0.016484
We can all also combine all the comparisons into one single variable called “treatment” with the following code:
pl.alpha <- plot_richness(ps.rawfilt, color = "treatment", x="treatment", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ treatment + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 4 24.28 6.07 1.395 0.26221
## reads 1 36.11 36.11 8.301 0.00767 **
## DNAconc 1 9.58 9.58 2.201 0.14946
## Residuals 27 117.47 4.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s plot again the community compositions, but now aggregating samples by their treatment. You can also aggregate them according to other variables in the sample_data!
# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps.filt, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)
# Plot each treatment group next to each other
ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ inoculum, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between Philipp's and Vincent's kefir cultures")ps.gen.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ time, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between start and end cultures")ps.gen.q3 <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen.q3), TRUE)[1:10])
ps.gen.q3 <- prune_taxa(Top10_gen, ps.gen.q3)
# Melt them for plotting
ps.gen.q3.df <- psmelt(ps.gen.q3)
ps.gen.q3.df %>%
ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
geom_bar(stat = "identity", position = "stack") +
theme_bw() +
facet_grid(~ fruit, scales = "free_x", space = "free_x") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 80, vjust = 0.5)) +
ggtitle("Relative abundance at the Genus level between figs and raisins")As previously, we can all also combine all the comparisons into one single variable called “treatment” with the following code:
# Plot the proportions to better compare the communities
pl.com.pro <- plot_bar(ps.filt, x="id_sample",fill="Linnaeus") +
facet_wrap(~treatment, scales="free_x", nrow=1) +
theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="bottom")+
guides(color=guide_legend(ncol=6, byrow=FALSE))
pl.com.proNow we are going to quantify how (dis)similar our communities are. We will use two distance metrics, and visualize them with an ordination.
# Ordinate using Principal Coordinate Analysis & NMDS
ps.nmds.bray <- ordinate(ps.rawfilt, "NMDS", "bray",trymax=50)## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1396999
## Run 1 stress 0.1421336
## Run 2 stress 0.1418913
## Run 3 stress 0.154612
## Run 4 stress 0.1391781
## ... New best solution
## ... Procrustes: rmse 0.1009367 max resid 0.2512856
## Run 5 stress 0.1498729
## Run 6 stress 0.1569684
## Run 7 stress 0.1414483
## Run 8 stress 0.1410611
## Run 9 stress 0.1500274
## Run 10 stress 0.1494211
## Run 11 stress 0.1441332
## Run 12 stress 0.1421798
## Run 13 stress 0.151331
## Run 14 stress 0.1579865
## Run 15 stress 0.1507755
## Run 16 stress 0.1410608
## Run 17 stress 0.1382355
## ... New best solution
## ... Procrustes: rmse 0.09276318 max resid 0.2476177
## Run 18 stress 0.1411982
## Run 19 stress 0.1422367
## Run 20 stress 0.1436699
## Run 21 stress 0.1421926
## Run 22 stress 0.1412372
## Run 23 stress 0.1410612
## Run 24 stress 0.1391757
## Run 25 stress 0.1522416
## Run 26 stress 0.1395976
## Run 27 stress 0.1410607
## Run 28 stress 0.1411982
## Run 29 stress 0.1387484
## Run 30 stress 0.1456364
## Run 31 stress 0.1395978
## Run 32 stress 0.1397051
## Run 33 stress 0.1412372
## Run 34 stress 0.1498728
## Run 35 stress 0.1533121
## Run 36 stress 0.1397057
## Run 37 stress 0.1434789
## Run 38 stress 0.143791
## Run 39 stress 0.1411981
## Run 40 stress 0.1447864
## Run 41 stress 0.1397004
## Run 42 stress 0.1411987
## Run 43 stress 0.1458232
## Run 44 stress 0.1534609
## Run 45 stress 0.1437985
## Run 46 stress 0.1459104
## Run 47 stress 0.1391795
## Run 48 stress 0.1441332
## Run 49 stress 0.1395977
## Run 50 stress 0.1503134
## *** Best solution was not repeated -- monoMDS stopping criteria:
## 49: stress ratio > sratmax
## 1: scale factor of the gradient < sfgrmin
# Check stress level and dimensions used
ps.nmds.bray##
## Call:
## metaMDS(comm = veganifyOTU(physeq), distance = distance, trymax = 50)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(veganifyOTU(physeq)))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1382355
## Stress type 1, weak ties
## Best solution was not repeated after 50 tries
## The best solution was from try 17 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))'
# Plot NMDS
pl.nmds.bray <- plot_ordination(ps.rawfilt, ps.nmds.bray,color="treatment",title="NDMS of Bray-Curtis dissimilarities",label="id_sample" ) +
stat_ellipse(aes(group = inoculum))
pl.nmds.brayAnd let’s test if the differences in the composition of communities from the same group are smaller than between groups using a Permutational Multivariate ANOVA (PerMANOVA)
# Calculate Bray-Curtis distance
ps.bray <- phyloseq::distance(ps.rawfilt,"bray")
# Test with ADONIS
ps.adonis.int <- adonis2(ps.bray ~ reads + (inoculum*fruit), data = ps.dataA, perm =9999)
ps.adonis.int